327 research outputs found

    Shallow Text Clustering Does Not Mean Weak Topics: How Topic Identification Can Leverage Bigram Features

    Get PDF
    DMNLP co-located with the European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML-PKDD)International audienceText clustering and topic learning are two closely related tasks. In this paper, we show that the topics can be learnt without the absolute need of an exact categorization. In particular, the experiments performed on two real case studies with a vocabulary based on bigram features lead to extracting readable topics that cover most of the documents. Precision at 10 is up to 74% for a dataset of scientific abstracts with 10,000 features, which is 4% less than when using unigrams only but provides more interpretable topics

    Mesurer la proximité entre corpus par de nouveaux méta-descripteurs

    Get PDF
    Devant le nombre d'algorithmes de classification existants, trouver l'algorithme qui sera le plus adapté pour classer un corpus de documents est une tâche difficile. La métaclassification apparaît aujourd'hui très utile pour aider à déterminer, en fonction des expériences passées, quel devrait être l'algorithme le plus pertinent par rapport à notre corpus. L'idée sous jacente est que "si un algorithme s'est montré particulièrement adapté pour un corpus, il devrait avoir le même comportement sur un corpus assez similaire". Dans cet article, nous proposons de nouveaux méta-descripteurs reposant sur les notions de similarités pour améliorer l'étape de méta-classification. Les expérimentations menées sur différents jeux de données réelles montrent la pertinence de nos nouveaux descripteurs. (Résumé d'auteur

    SuMGra: Querying Multigraphs via Efficient Indexing

    Get PDF
    International audienceMany real world datasets can be represented by a network with a set of nodes interconnected with each other by multiple relations. Such a rich graph is called a multigraph. Unfortunately, all the existing algorithms for subgraph query matching are not able to adequately leverage multiple relationships that exist between the nodes. In this paper we propose an efficient indexing schema for querying single large multi-graphs, where the indexing schema aptly captures the neighbourhood structure in the data graph. Our proposal SuMGra couples this novel indexing schema with a subgraph search algorithm to quickly traverse though the solution space to enumerate all the matchings. Extensive experiments conducted on real benchmarks prove the time efficiency as well as the scalability of SuMGra

    IFO2: A uniform approach for information system modelling

    Get PDF
    This paper is devoted to the IFO2 conceptual model, an extension of the semantic IFO model defined by S. Abiteboul and R. Hull. Its originalities are a uniform approach for both structural and behavioural application specifications, a "wholeobject" and "whole-event" approach, the use of constructors to express cornbinations of objects or events, the modularity and re-usability of specifications in order to optimize the designer's work. Furtherrnore, it offers an overview of the modelled system. To complement the modelling part, IFO2 includes a derivation component to perforrn the implementation of specifications by using an object-oriented or an active DBMS

    GET_MOVE: An Efficient and Unifying Spatio-Temporal Pattern Mining Algorithm for Moving Objects

    Get PDF
    International audienceRecent improvements in positioning technology has led to a much wider availability of massive moving object data. A crucial task is to find the moving objects that travel together. Usually, they are called spatio-temporal pat- terns. Due to the emergence of many different kinds of spatio-temporal patterns in recent years, different approaches have been proposed to extract them. However, each approach only focuses on mining a specific kind of pattern. In addition to the fact that it is a painstaking task due to the large number of algorithms used to mine and manage patterns, it is also time consuming. Additionally, we have to execute these algorithms again whenever new data are added to the existing database. To address these issues, we first redefine spatio-temporal patterns in the itemset context. Secondly, we propose a unifying approach, named GeT Move, using a frequent closed itemset-based spatio-temporal pattern-mining algorithm to mine and manage different spatio-temporal patterns. GeT Move is implemented in two versions which are GeT Move and Incremental GeT Move. Experiments are per- formed on real and synthetic datasets and the experimental results show that our approaches are very effective and outperform existing algorithms in terms of efficiency

    From Terminology Extraction to Terminology Validation: An Approach Adapted to Log Files

    Get PDF
    International audienceLog files generated by computational systems contain relevant and essential information. In some application areas like the design of integrated circuits, log files generated by design tools contain information which can be used in management information systems to evaluate the final products. However, the complexity of such textual data raises some challenges concerning the extraction of information from log files. Log files are usually multi-source, multi-format, and have a heterogeneous and evolving structure. Moreover, they usually do not respect natural language grammar and structures even though they are written in English. Classical methods of information extraction such as terminology extraction methods are particularly irrelevant to this context. In this paper, we introduce our approach Exterlog to extract terminology from log files. We detail how it deals with the specific features of such textual data. The performance is emphasized by favoring the most relevant terms of the domain based on a scoring function which uses a Web and context based measure. The experiments show that Exterlog is a well-adapted approach for terminology extraction from log files

    Querying RDF Data Using A Multigraph-based Approach

    Get PDF
    International audienceRDF is a standard for the conceptual description of knowledge , and SPARQL is the query language conceived to query RDF data. The RDF data is cherished and exploited by various domains such as life sciences, Semantic Web, social network, etc. Further, its integration at Web-scale compels RDF management engines to deal with complex queries in terms of both size and structure. In this paper, we propose AMbER (Attributed Multigraph Based Engine for RDF querying), a novel RDF query engine specifically designed to optimize the computation of complex queries. AMbER leverages subgraph matching techniques and extends them to tackle the SPARQL query problem. First of all RDF data is represented as a multigraph, and then novel indexing structures are established to efficiently access the information from the multigraph. Finally a SPARQL query is represented as a multigraph, and the SPARQL querying problem is reduced to the subgraph homomorphism problem. AMbER exploits structural properties of the query multigraph as well as the proposed indexes, in order to tackle the problem of subgraph homomorphism. The performance of AMbER, in comparison with state-of-the-art systems, has been extensively evaluated over several RDF benchmarks. The advantages of employing AMbER for complex SPARQL queries have been experimentally validated

    Node Overlap Removal Algorithms: A Comparative Study

    Get PDF
    Appears in the Proceedings of the 27th International Symposium on Graph Drawing and Network Visualization (GD 2019)Many algorithms have been designed to remove node overlapping, and many quality criteria and associated metrics have been proposed to evaluate those algorithms. Unfortunately, a complete comparison of the algorithms based on some metrics that evaluate the quality has never been provided and it is thus difficult for a visualization designer to select the algorithm that best suits his needs. In this paper, we review 21 metrics available in the literature, classify them according to the quality criteria they try to capture, and select a representative one for each class. Based on the selected metrics, we compare 8 node overlap removal algorithms. Our experiment involves 854 synthetic and real-world graphs

    RetweetPatterns: Detection of Spatio-Temporal Patterns of Retweets

    Get PDF
    International audienceSocial media is strongly present in people's everyday life and Twitter is one example that stands out. The data within these types of services can be analyzed in order to discover useful knowledge. One interesting approach is to use data mining techniques to perceive hidden behaviours and patterns. The primary focus of this paper is the identification of patterns of retweets and to understand how information spreads over time in Twitter. The aim of this work lies in the adaptation of the GetMove tool, that is capable of extracting spatio-temporal pattern tra-jectories, and TweeProfiles, that identifies tweet profiles regarding several dimensions: spatial, temporal, social and content. We hope that the more flexible clustering strategy from TweeProfiles will enhance the results extracted by GetMove. We study the application of said mechanism to one case study and developed a visualization tool to interpret the results

    Mining microarray data to predict the histological grade of a breast cancer

    Get PDF
    BACKGROUND: The aim of this study was to develop an original method to extract sets of relevant molecular biomarkers (gene sequences) that can be used for class prediction and can be included as prognostic and predictive tools. MATERIALS AND METHODS: The method is based on sequential patterns used as features for class prediction. We applied it to classify breast cancer tumors according to their histological grade. RESULTS: We obtained very good recall and precision for grades 1 and 3 tumors, but, like other authors, our results were less satisfactory for grade 2 tumors. CONCLUSIONS: We demonstrated the interest of sequential patterns for class prediction of microarrays and we now have the material to use them for prognostic and predictive applications
    • …
    corecore